Adapting Lexical and Language Models for Transcription of Highly Spontaneous Spoken Czech

نویسندگان

  • Jan Nouza
  • Jan Silovský
چکیده

The paper deals with the problem of automatic transcription of spontaneous conversations in Czech. That type of speech is informal with many colloquial words. It is difficult to create an appropriate lexicon and language model when linguistic resources representing colloquial Czech are limited to several small corpora collected by the Institute of Czech National Corpus. To overcome this, we introduce transformations between the most frequent colloquial words and their counterparts in formal Czech. This allows us a) to combine the small spoken corpora with much larger corpora of more formal texts, b) to optimize the recognizer’s lexicon, and c) to solve the data sparsity problem when computing a probabilistic language model. We have applied this approach in the design of a system for transcription of spontaneous telephone conversations. Its recent version operates with accuracy about 48% and the proposed transformations together with corpora mixing contributed to 9% improvement compared to the baseline system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Large vocabulary ASR for spontaneous czech in the MALACH project

This paper describes LVCSR research into the automatic transcription of spontaneous Czech speech in the MALACH (Multilingual Access to Large Spoken Archives) project. This project attempts to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) (www.vhf.org) by advancing the state of the art in automated speech...

متن کامل

Sub-lexical Dialogue Act Classification in a Spoken Dialogue System Support for the Elderly with Cognitive Disabilities

This paper presents a dialogue act classification for a spoken dialogue system that delivers necessary information to elderly subjects with mild dementia. Lexical features have been shown to be effective for classification, but the automatic transcription of spontaneous speech demands expensive language modeling. Therefore, this paper proposes a classifier that does not require language modelin...

متن کامل

Spoken Malay Language Influence on Automatic Transcription and Segmentation

The influence of Malay language into modeling a Malay speech lexicon can be potentially useful for a more accurate transcription and segmentation. The problem arises when trying to discriminate the boundaries between similar sounding phonemes for segmentation, especially in dyslexic children‘s speech when reading, which have been influenced by the surrounding phonemes (before and after) thus ma...

متن کامل

Partial Parsing of Spontaneous Spoken French

This paper describes the process and the resources used to automatically annotate a French corpus of spontaneous speech transcriptions in super-chunks. Super-chunks are enhanced chunks that can contain lexical multiword units. This partial parsing is based on a preprocessing stage of the spoken data that consists in reformatting and tagging utterances that break the syntactic structure of the t...

متن کامل

Adapting lexical representation and OOV handling from written to spoken language with word embedding

Word embeddings have become ubiquitous in NLP, especially when using neural networks. One of the assumptions of such representations is that words with similar properties have similar representation, allowing for better generalization from subsequent models. In the standard setting, two kinds of training corpora are used: a very large unlabeled corpus for learning the word embedding representat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010